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W ' Abstract 

^^ , In this paper we develop a randomized block-coordinate descent method for minimizing the 

i-G ' sum of a smooth and a simple nonsmooth block-separable convex function and prove that it 

C^ I obtains an e-accurate solution with probability at least 1 — p in at most 0(- log -) iterations, 

where n is the number of blocks. For strongly convex functions the method converges linearly. 
This extends recent results of Nesterov [Efhciency of coordinate descent methods on huge-scale 
optimization problems, CORE Discussion Paper 7^2010/2], which cover the smooth case, to 
composite minimization, while at the same time improving the complexity by the factor of 4 and 
QQ ' removing e from the logarithmic term. More importantly, in contrast with the aforementioned 

^^ , work in which the author achieves the results by applying the method to a regularized version 

OO ' of the objective function with an unknown scaling factor, we show that this is not necessary, 

f^ I thus achieving true iteration complexity bounds. In the smooth case we also allow for arbitrary 

t~^ ' probability vectors and non-Euclidean norms. Finally, we demonstrate numerically that the 

^3 I algorithm is able to solve huge-scale £i-regularized least squares and support vector machine 

problems with a billion variables. 

Keywords: Block coordinate descent, iteration complexity, composite minimization, coor- 
^^ , dinate relaxation, alternating minimization, convex optimization, Ll-regularization, large scale 

$H ' support vector machines. 

1 Introduction 

The goal of this paper, in the broadest sense, is to develop efhcient methods for solving structured 
convex optimization problems with some or all of these (not necessarily distinct) properties: 
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1. Size of Data. The size of the problem, measured as the dimension of the variable of interest, 
is so large that the computation of a single function value or gradient is prohibitive. There 
are several situations in which this is the case, let us mention two of them. 

• Memory. If the dimension of the space of variables is larger than the available memory, 
the task of forming a gradient or even of evaluating the function value may be impossible 
to execute and hence the usual gradient methods will not work. 

• Patience. Even if the memory does not preclude the possibility of taking a gradient step, 
for large enough problems this step will take considerable time and, in some applications 
such as image processing, users might prefer to see/have some intermediary results before 
a single iteration is over. 

2. Nature of Data. The nature and structure of data describing the problem may be an obstacle 
in using current methods for various reasons, including the following. 

• Completeness. If the data describing the problem is not immediately available in its 
entirety, but instead arrives incomplete in pieces and blocks over time, with each block 
"corresponding to" one variable, it may not be realistic (for various reasons such as 
"memory" and "patience" described above) to wait for the entire data set to arrive before 
the optimization process is started. 

• Source. If the data is distributed on a network not all nodes of which are equally 
responsive or functioning, it may be necessary to work with whatever data is available 
at a given time. 

It appears that a very reasonable approach to solving some problems characterized above is to 
use (block) coordinate descent methods (CD). In the remainder of this section we mix arguments in 
support of this claim with a brief review of the relevant literature and an outline of our contributions. 

1.1 Block Coordinate Descent Methods 

The basic algorithmic strategy of CD methods is known in the literature under various names such 
as alternating minimization, coordinate relaxation, linear and non-linear Gauss-Seidel methods, 
subspace correction and domain decomposition. As working with all the variables of an optimization 
problem at each iteration may be inconvenient, difficult or impossible for any or all of the reasons 
mentioned above, the variables are partitioned into manageable blocks, with each iteration focused 
on updating a single block only, the remaining blocks being fixed. Both for their conceptual and 
algorithmic simplicity, CD methods were among the first optimization approaches proposed and 
studied in the literature (see [l] and the references therein; for a survey of block CD methods in 
semidefinite programming we refer the reader to |24|). While they seem to have never belonged 
to the mainstream focus of the optimization community, a renewed interest in CD methods was 
sparked recently by their successful application in several areas — training support vector machines 
in machine learning El [HI [28l [29] , optimization |9l[23l[2ll[20l[3ll[H[l3l[2S], compressed sensing 
[8], regression |27| . protein loop closure [2] and truss topology design |15| — partly due to a change 
in the size and nature of data described above. 



Order of coordinates. Efficiency of a CD method will necessarily depend on the balance between 
time spent on choosing the block to be updated in the current iteration and the quality of this choice 
in terms of function value decrease. One extreme possibility is a greedy strategy in which the block 
with the largest descent or guaranteed descent is chosen. In our setup such a strategy is prohibitive 
as i) it would require all data to be available and ii) the work involved would be excessive due to 
the size of the problem. Even if one is able to compute all partial derivatives, it seems better to 
then take a full gradient step instead of a coordinate one, and avoid throwing almost all of the 
computed information away. On the other end of the spectrum are two very cheap strategies for 
choosing the incumbent coordinate: cyclic and random. Surprisingly, it appears that complexity 
analysis of a cyclic CD method in satisfying generality has not yet been done. The only attempt 
known to us is the work of Saha and Tewari |16| : the authors consider the case of minimizing a 
smooth convex function and proceed by establishing a sequence of comparison theorems between 
the iterates of their method and the iterates of a simple gradient method. Their result requires an 
isotonicity assumption. Note that a cyclic strategy assumes that the data describing the next block 
is available when needed which may not always be realistic. The situation with a random strategy 
seems better; here are some of the reasons: 

(i) Recent efforts suggest that complexity results are perhaps more readily obtained for random- 
ized methods and that randomization can actually improve the convergence rate |18| IHlfTT]. 

(ii) Choosing all blocks with equal probabilities should, intuitively, lead to similar results as is the 
case with a cyclic strategy. In fact, a randomized strategy is able to avoid worst-case order of 
coordinates, and hence might be preferable. 

(iii) Randomized choice seems more suitable in cases when not all data is available at all times. 

(iv) One may study the possibility of choosing blocks with different probabilities (we do this in 
Section |4j. The goal of such a strategy may be either to improve the speed of the method (in 
Section [6. II we introduce a speedup heuristic based on adaptively changing the probabilities), 
or a more realistic modeling of the availability frequencies of the data defining each block. 

Step size. Once a coordinate (or a block of coordinates) is chosen to be updated in the current 
iteration, partial derivative can be used to drive the steplength in the same way as it is done in the 
usual gradient methods. As it is sometimes the case that the computation of a partial derivative is 
much cheaper and less memory demanding than the computation of the entire gradient, CD methods 
seem to be promising candidates for problems described above. It is important that line search, if 
any is implemented, is very efficient. The entire data set is either huge or not available and hence 
it is not reasonable to use function values at any point in the algorithm, including the line search. 
Instead, cheap partial derivative and other information derived from the problem structure should 
be used to drive such a method. 

1.2 Problem Description and Our Contribution 

The problem. In this paper we study the iteration complexity of simple randomized block coor- 
dinate decent methods applied to the problem of minimizing a composite objective function, i.e., a 
function formed as the sum of a smooth convex and a simple nonsmooth convex term: 

min F{x)=^f{x) + ^{x). (1) 



We assume that this problem has a minimum {F* > — oo), / has (block) coordinate Lipschitz 
gradient, and ^ is a (block) separable proper closed convex extended real valued function (these 
properties will be defined precisely in Section [2]). Possible choices of ^ include: 

(i) ^ = 0. This covers the case of smooth minimization. Complexity results are given in |13| . 

(ii) ^ is the indicator function of a block-separable convex set (such as a box). This choice models 
problem,s with constraints on blocks of variables; iteration complexity results are given in |13| . 



(iii) ^(x) = A||2;||i for A > 0. In this case we can decompose R onto N blocks. Increasing 
A encourages the solution of ([1]) to be sparser |26j . Applications abound in, for instance, 
machine learning [3], statistics [19] and signal processing [8]. 

(iv) There are many more choices such as the elastic net |32| , group lasso |301 \W\ [T^ and sparse 
group lasso [4]. 

Iteration complexity results. Strohmer and Vershynin |18) have recently proposed a random- 
ized Karczmarz method for solving overdetermined consistent systems of linear equations and proved 
that the method enjoys global linear convergence whose rate can be expressed in terms of the condi- 
tion number of the underlying matrix. The authors claim that for certain problems their approach 
can be more efficient than the conjugate gradient method. Motivated by these results, Leventhal 
and Lewis [6] studied the problem of solving a system of linear equations and inequalities and in the 
process gave iteration complexity bounds for a randomized CD method applied to the problem of 
minimizing a convex quadratic function. In their method the probability of choice of each coordinate 
is proportional to the corresponding diagonal element of the underlying positive semidefinite matrix 
defining the objective function. These diagonal elements can be interpreted as Lipschitz constants 
of the derivative of a restriction of the quadratic objective onto one-dimensional lines parallel to the 
coordinate axes. In the general (as opposed to quadratic) case considered in this paper ([T]), these 
Lipschitz constants will play an important role as well. Lin et al. [3] derived iteration complex- 
ity results for several smooth objective functions appearing in machine learning. Shalev-Schwarz 
and Tewari |17) proposed a randomized coordinate descent method with uniform probabilities for 
minimizing ^i-regularized smooth convex problems. They first transform the problem into a box 
constrained smooth problem by doubling the dimension and then apply a coordinate gradient de- 
scent method in which each coordinate is chosen with equal probability. Nesterov |13| has recently 
analyzed randomized coordinate descent methods in the smooth unconstrained and box-constrained 
setting, in effect extending and improving upon some of the results in [6l [3l [17] in several ways. 

While the asymptotic convergence rates of some variants of CD methods are well understood 
m [23l [21] [20l [31], iteration complexity results are very rare. To the best of our knowledge, ran- 
domized CD algorithms for minimizing a composite function have been proposed and analyzed (in 
the iteration complexity sense) in a few special cases only: a) the unconstrained convex quadratic 
case [6], b) the smooth unconstrained (^ = 0) and the smooth block-constrained case (\I' is the 
indicator function of a direct sum of boxes) [13] and c) the ^i-regularized case [17] . As the approach 
in |17| is to rewrite the problem into a smooth box-constrained format first, the results of |13| can 
be viewed as a (major) generalization and improvement of those in [17] (the results were obtained 
independently) . 



Contribution. In this paper we further improve upon and extend and simplify the iteration 
complexity results of Nesterov |13| , treating the problem of minimizing the sum of a smooth convex 
and a simple nonsmooth convex block separable function ([1]). We focus exclusively on simple 
(as opposed to accelerated) methods. The reason for this is that the per-iteration work of the 
accelerated algorithm in [13] on huge scale instances of problems with sparse data (such as the 
Google problem where sparsity corresponds to each website linking only to a few other websites 
or the sparse problems we consider in Section [6]) is excessive. In fact, even the author does not 
recommend using the accelerated method for solving such problems; the simple methods seem to 
be more efficient. 

Each algorithm of this paper is supported by a high probability iteration complexity result. That 
is, for any given confidence level < p < 1 and error tolerance e > 0, we give an explicit expression 
for the number of iterations k which guarantee that the method produces a random iterate Xk for 
which 

P{F{xk) - F* < e) > I - p. 

Table [T] summarizes the main complexity results of this paper. Algorithm [2] — Uniform (block) 
Coordinate Descent for Composite functions (UCDC) — is a method where at each iteration the 
block of coordinates to be updated (out of a total of n < iV blocks) is chosen uniformly at random. 
Algorithm [3] — Randomized (block) Coordinate Descent for Smooth functions (RCDS) — is a method 
where at each iteration block i G {1, . . . ,n} is chosen with probability pi. Both of these methods 
are special cases of the generic Algorithm [U Randomized (block) Coordinate Descent for Composite 
functions (RCDC). 



Algorithm 
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Complexity 



Algorithm m (UCDC) 
(Theorem \^ 

Algorithm [2] (UCDC) 
(Theorem O 

Algorithm E] (RCDS) 
(Theorem [TT]| 

Algorithm |3] (RCDS) 
(Theorem [III 



composite 

strongly convex 
composite 

convex 
smooth 

strongly convex 
smooth 



2nra!„c{'R.i{xo),Fixo)-F'} ^ 



2n-R.i(xo) , f F(xo)~F 



l + log^) 



log 



(£1,^) 



inax|-, — ^jnloe 



F(xo)^F' 



27^fp-l(^o) 



(l+logi)-2 



1 log ' f^^'il-f 



Table 1: Summary of complexity results obtained in this paper. 



The symbols P, L,7^^(a;o) and /n appearing in Table [T] will be defined precisely in further 
sections. For now it suffices to say that L encodes the (block) coordinate Lipschitz constants 
of the gradient of /, P encodes the probabilities {pi}, ^Z^r{xQ) is a measure of distance of the initial 
iterate xq from the set of minimizers of the problem ([1]) in a norm defined by W (see Section [2]) and 
p is the strong convexity parameter of F (see Section 13. 2p . In the nonsmooth case p depends on L 
and the smooth case it depends both on L and P. 

Let us now briefly outline the main similarities and differences between our results and those in 
[13| . A more detailed and expanded discussion can be found in Section [5l 



1. Composite setting. We consider the composite setting ([T]), whereas |13| covers tlie uncon- 
strained and constrained smooth setting only. 

2. No need for regularization. Nesterov's high probability results in the case of minimizing 
a function which is not strongly convex are based on regularizing the objective to make it 
strongly convex and then running the method on the regularized function. Our contribution 
here is that we show that no regularization is needed by doing a more detailed analysis using 
a thresholding argument (Theorem [T]) . 

3. Better complexity. Our complexity results are better by the constant factor of 4. Also, we 
have removed e from under the logarithm. 

4. General probabilities. Nesterov considers probabilities pi proportional to Lf, where a > 
is a parameter. High probability results are proved in |13| for a G {0, 1} only. Our results in 
the smooth case hold for an arbitrary probability vector p. 

5. General norms. Nesterov's expectation results (Theorems 1 and 2) are proved for general 
norms. However, his high probability results are proved for Euclidean norms only. In our 
approach all results hold for general norms. 

6. Simplification. Our analysis is more compact. 

In the numerical experiments section we focus on sparse ^i-regularized regression and support 
vector machine problems. For these problems we introduce a powerful speedup heuristic based 
on adaptively changing the probability vector throughout the iterations (Section 16. li "speedup by 
shrinking"). 

Contents. This paper is organized as follows. We start in Section [2] by defining basic notation, 
describing the block structure of the problem, stating assumptions and describing the generic ran- 
domized block-coordinate descent algorithm (RCDC). In Section [3] we study the performance of a 
uniform variant (UCDC) of RCDC as applied to a composite objective function and in Section 3] 
we analyze a smooth variant (RCDS) of RCDC; that is, we study the performance of RCDC on 
a smooth objective function. In Section [5] we compare known complexity results for CD methods 
with the ones established in this paper. Finally, in Section [6] we demonstrate the efficiency of the 
method on £i-regularized sparse regression and linear support vector machine problems. 

2 Assumptions and the Algorithm 

Block structure. We model the block structure of the problem by decomposing the space R 
into n subspaces as follows. Let U € R^^^ be a column permutation of the N x N identity 
matrix and further let U = [Ui, U2, ■ ■ ■ , Un] be a decomposition of U into n submatrices, with f/j 
being of size N x Ni, where X^j A'j = N. Clearly, any vector x S R can be written uniquely as 
X = Y2i UiX^^\ where x^*) = U^x G Rj = R^\ Also note that 

rr \ Nj X N-i identity matrix, if i = 7, 

I A'j X Nj zero matrix, otherwise. 



For simplicity we will write x = {x^^', . . . , x^^'Y' . We equip Rj with a pair of conjugate Euclidean 
norms: 

||i|l(,) = (i?,t,i)i/2, \\t\\l,^ = {Br\t)^l\ teRi, (3) 

where Bi € R^»^^' is a positive definite matrix and (•, •) is the standard Euclidean inner product. 

Example 1. Let n = N , Ni = 1 for all i and U = [ei,e2, • • • ,e.„] be the n x n identity matrix. 
Then Ui = Cj is the i-th unit vector and x^^> = efx € Rj = R is the i-th coordinate of x. Also, 
X = ^^CiX*-*'. If we let Bi = 1 for alii, then \\t\\u\ = \\i\\%) = \t\ for all t G R. 

Smoothness of /. We assume throughout the paper that the gradient of / is block coordinate- 
wise Lipschitz, uniformly in x, with positive constants Li, . . . ,Ln, i.e., that for all x G R , t G Rj 
and i we have 

\\Vif{x + U,t) - Vi/(x)||^,) < L,||t||(,), (4) 

where 

Vd{x) ''^' (V/(x))» = C/f V/(x) G R.. (5) 

An important consequence of (j4]) is the following standard inequality |11| : 

fix + Uit) < fix) + (V,/(x),i) + ^\\t\\ly (6) 

Separability of ^. We assume that ^ is block separable, i.e., that it can be decomposed as 

follows: 

n 

^(x) = 5^^i(x«), (7) 

1=1 

where the functions ^j : Rj — ?> R are convex and closed. 

The algorithm. Notice that an upper bound on F(x + Uit), viewed as a function of t G Ri, is 
readily available: 

F{x + Uit)^ f{x + Uit) + ^ix + U,t) < f{x) + V,ix,t) + Qix), (8) 



where 
and 



Viix,t) "^^ {Vifix),t) + ^||i||2^^ +M/,(x« +t) (9) 



C.(rr)"=^'J]*,(x(^)). (10) 

We are now ready to describe the generic method. Given iterate xj., Algorithm [1] picks block 
ik = i ^ {1;2, . . . ,n} with probability pj > and then updates the i-th block of x^ so as to 
minimize (exactly) in t the upper bound ([S]) on Fix^ + Uit). Note that in certain cases it is possible 
to minimize Fix^ + Uit) directly; perhaps in a closed form. This is the case, for example, when / 
is a convex quadratic. 

The iterates {x^} are random vectors and the values {F(xfc)} are random variables. Clearly, 
Xk+i depends only on x^. As our analysis will be based on the (expected) per- iteration decrease of 
the objective function, the results will hold even if we replace Viix^^t) by F(x^. + C/jt) in Algorithm[TJ 
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Algorithm 1 RCDC(p, xq) (Randomized Coordinate Descent for Composite Functions) 
for k = 0,1,2,... do 

Choose ifc = i G {1, 2, . . . , n} with probability pi 

T«(xfc) = argmm{Viixk,t) : t G R,} 
end for 



Global structure. For fixed positive scalars wi, . . . , Wn let W = T)iag{wi, . . . , Wn) and define a 
pair of conjugate norms in R by 



\x\\w 



2_^UJi\\JL ||(j) 



.1=1 



1/2 



12/IIh/ = max (y,x) 



,i=l 



1/2 



fill 



(12) 



In the the subsequent analysis we will use W = L (Section [3]) and W = LP ^ (Section U]), where 
L = Diag(Li, ...,Ln) and P = Diag(pi, . . . ,p„). 

The set of optimal solutions of ([T|) is denoted by X* and x* is any element of that set. Define 

7^Vk(3;) = max max {lly — x*||vy : F{y)<F{x)}, 

y x*ex* 

which is a measure of the size of the level set of F given by x. In most of the results in this paper 
we will need to assume that 7^14^(2^0) is finite for the initial iterate xq and W = L or W = LP^^. 

A technical result. The next simple result is the main technical tool enabling us to simplify and 
improve the corresponding analysis in |13) . It will be used with ^^ = F{xk) — F* . 

Theorem 1. Let ^q > be a constant, < e < ^o> ^iT-d consider a nonnegative nonincreasing 
sequence of (discrete) random variables {£,k}k>o with one of the following properties: 

(i) E[^fc_|_i I ^fc] < ^k — ;r, for all k, where c > is a constant, 
(ii) E[^fc+i I ^fc] < (1 — -)ik, for all k such that ^k ^ £> where c > 1 is a constant. 
Choose confidence level p G (0, 1). If property (i) holds and we choose e < c and 

i^>f(l + logi) + 2-|^, 
or if property (ii) holds, and we choose 



then 



K>c\og%, 



^{^K < e) > 1 



(13) 

(14) 
(15) 



Proof. Notice that the sequence {'^^}fc>o defined by 

1 otherwise, 
satisfies 

ek<e ^ 6<e, k>0. (16) 

Therefore, by Markov inequafity, 

P(efc>e) = P(a>e)<™, 



and hence it suffices to show that 

9k < ep, (17) 

def 

where 9^ = E[^^]. If property (i) holds, then 

E[a+iia]<a-¥, na+i\a]<i^-mi k>o, (is) 

and by taking expectations (using convexity of t i— > t^ in the first case) we obtain 

0k+i < Ok-'4, k>0, (19) 

Ok+i < (l-f)^fc, A:>0. (20) 

Notice that (J19p is better than (I20p precisely when 9k > £■ Since 

1 1 Sfe~Sfc+i ^ ^fc"^fc+i \ 1 



9k+i dk 9k+i9k - el - c 

we have ^>^ + ^ = T- + f- Therefore, if we let fci > ^ — ^, we obtain 9^^ < e. Finally, letting 
k2 > J log - , we have 

^x < ^.,+fe < (l-f)'^^., <((l-f)^)'=^°'^6<(e-c)^'°*^Pe = 6p, 
establishing (|17p . If property (ii) holds, then E[^L ^^ | Ck\ — (1 ~ c)^k ^"-"^ ^^^ ^' ^'^'-^ hence 

^i. < (1 - IfOo = (1 - i)^eo f ((1 - Irf'^^^Co < {e-'f'i^o = ep, 
again establishing (fT7|) . D 

Restarting. Note that similar, albeit slightly weaker, high probability results can be achieved by 
restarting as follows. We run the random process {^k) repeatedly r = [log -] times, always starting 
from ^0) each time for the same number of iterations ki for which P(^fci >()<-. It then follows 
that the probability that all r values ^/fcj will be larger than e is at most (-)'' < p. Note that the 
restarting technique demands that we perform r evaluations of the objective function; this is not 
needed in the one-shot approach covered by the theorem. 
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It remains to estimate ki in the two cases of Theorem [TJ We argue that in case (i) we can choose 
^1 = r^ ~ F"l- Indeed, using similar arguments as in Theorem [T] this leads to E[^fc^] < |, which 
by Markov inequality implies that in a single run of the process we have 

Pfe, > e) < ^ < ^ = i. 

Therefore, 

K=\f-i] [log i] 

iterations suffice in case (i). A similar restarting technique can be applied in case (ii). 
Tightness. It can be shown on simple examples that the bounds in the above result are tight. 

3 Coordinate Descent for Composite Functions 

In this section we study the performance of Algorithm [1] in the special case when all probabilities 
are chosen to be the same, i.e., Pi = - for all i. For easier future reference we set this method apart 
and give it a name (Algorithm [2]) . 

Algorithm 2 UCDC(xo) (Uniform Coordinate Descent for Composite Functions) 
for /c = 0,1,2,... do 

Choose ifc = i G {1, 2, . . . , ?i} with probability - 
T'^'^Xk) = avgmin{Vi{xk,t) : t G Ri} 

xfc+i = xfc + c/irW(xfc) 

end for 

The following function plays a central role in our analysis: 

H{x,T) ""^'fix) + (V/(x),r) + i||r|[i + ^(x + T). (21) 

Comparing 1^ with ([9]) using ([2]), ([5]), ([7]) and ([lU we get 

n 

ij(x,T) = /(x) + ^y,(x,r«). (22) 

Therefore, the vector T(x) = (T'^'(x), . . . ,T("'(x)), with the components T^^'[x) defined in Algo- 
rithm [H is the minimizer of H{x, •): 

r(x) = arg min H{x,T). (23) 

Let us start by establishing an auxiliary result which will be used repeatedly. 
Lemma 2. Let {x^}, k >0, be the random iterates generated by UCDC\xq). Then 

B[F{xk+i) - F* I xfc] < i {H{xk,T{xk)) - F*) + ^ (F(xfc) - F*). (24) 
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Proof. 

n 
i=l 

< i Y^^fi^k) + V-(xfc,T«(xfc)) + a(xfc)] 

n 



02} 



71 






D 



3.1 Convex Objective 



In order for Lemma [2] to be useful, we need to estimate H(xk,T(xk)) — F* from above in terms of 
Fixk) - F*. 

Lemma 3. Fix x* G X* , x G dom^ and let R = \\x - x*\\l. Then 

'l - ^^^^) {F{x) - F*), if Fix) -F*<R^ 
^ gi?^ < ^(F(x) - F*), otherwise. 

Proof. 



Hix,T{x))-F*<^Y 1^. {"^ ^" "^^"' ^ ""' (25) 



H(x,T(x)) = min H{x,T) 

= min H(x,y — x) 

< min /(x) + (V/(x),y-x) + $(y) + i||y-x||i 

j/GR^ 

< mm F(y) + 7^\\y — x\\t 

- yGR^ ^^ 2" "^ 

< min F(ax* + (1 — a)x) + ^||x — x*||x, 

aG[0,l] 

< min F(x) - a(F(x) - F*) + ^i?^^ (26) 

aG[0,l] 

Minimizing ()26p in a gives a* = min |l, (-F(x) — -F*)/i?^|; the result follows. D 

We are now ready to estimate the number of iterations needed to push the objective value within 
e of the optimal value with high probability. Note that since p appears under the logarithm and 
hence it is easy to attain high confidence. 

Theorem 4. Choose initial point xq and target confidence < p < 1. Further, let the target 
accuracy e > and iteration counter k be chosen in any of the following two ways: 
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(i) e < F{xo) - F* and 

,__ ^ 2nmax{7^|(xo),F(xo)-F'^} /, , ,^^i\ , ^ 2nmax{7^|(xo),F(xo)-F*} , . 

K> 1 (^-"^ + ^og p j + ^ F{xo)-F* ' y^'i 

(a) e < min{7?.|(xo),-F(xo) - F*} and 

,^^ 2n1Zli.o) ^^^ Fi.o)-F' _ (28) 

If Xk is the random point generated by UCDC{xq) as applied to the convex function F, then 

P{F{xk)-F* <e)>l-p. 

Proof. Since F{xk) < F{xo) for all k, we have ||xfc — x*\\l < 1Zl{xq) for all x* € X* . Lemma [2] 
together with Lemma [3] then imply that the following holds for all k: 

E[F(xfc+i)-F*|xfc] < imax{l-|Xjl^,l}(F(xfc)-F*) + i^(F(xfc)-F*) 

^ ™{l-S^'l-2k}(^(-^)-n. (29) 

Let ^k = F{xk) — F* and consider case (i). If we let c = 2nu\.Siyi{lZ\{xQ),F{xQ) — -F*}, then from 
(f29l) we obtain 

E[a+i I a] < (1 - %)6 = 6 - I, A; > 0. 

Moreover, e < ^o < c. The result then follows by applying Theorem [TJ Consider now case (ii). 

Letting c = — -^^^ — — > 1, notice that if ^fc > e, inequality (I29p implies that 



.k- 



mk+i I Cd < max{l - ^^^^1(^,1 - 2^}^^ = (1 - i)6 
Again, the result follows from Theorem [TJ D 

3.2 Strongly Convex Objective 

Assume that F is strongly convex with respect to some norm || • || with convexity parameter // > 0; 
that is, 

F{x)>F{y) + {F'{y),x-y) + !^\\x-yf, x,yedomF, (30) 

where F'{y) is any subgradient of F at y. Note that from the first order optimality conditions for 
([T|) we obtain {F'{x*),x — x*) > for all x S domF which, combining with (j30p used with y = x*, 
yields the standard inequality 

F(x)-F* > f||x-x*f, xedomF. (31) 

The next lemma will be useful in proving linear convergence of the expected value of the objective 
function to the minimum. 
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Lemma 5. If F is strongly convex with respect to \\ ■ \\l with convexity parameter ^ > 0, then 

H{x,T{x))-F* <-ff,{F{x)-F*), xedouiF, (32) 



where 



Proof. 



7/. = < 1 ,, . (33) 

-, otherwise. 



H{x,T{x)) = min H{x,t) 

= mill H(x,ti — x) 

< min F(y) + -^Wy — x\\r 

< mill F iax* + il — a)x) + %-\\x — X* \\ T 

aG[0,l] ' J I u 

< min F{x) - a{F{x) - F*) + ^H^; - x*\\l 

aG[0,l] 

eg , . 

< mm F{x) + a(^-l]{F{x)-F*). (34) 

The optimal a in ([M|) is a* = min {l, f }; the result follows. D 

We now show that the expected value of F{xk) converges to F* linearly. 

Theorem 6. Let F he strongly convex with respect to the norm \\ ■ \\l with convexity parameter 
H > 0. If Xk is the random point generated UCDC{xq), then 

E[F(x,) - F*] < (l - IzIe)' ^f{x,) - F*), (35) 

where 7^ is defined by LS3\) . 

Proof. Follows from Lemma [2] and Lemma [5j D 

The following is an analogue of Theorem 2] in the case of a strongly convex objective. Note that 
both the accuracy and confidence parameters appear under the logarithm. 

Theorem 7. Let F be strongly convex with respect to \\ ■ \\l with convexity parameter // > and 
choose accuracy level e > 0, confidence level < p < 1, and 

fc>T^log(^^^^), (36) 

where 7^ is given by (j33p . If x/^. is the random point generated by UCDC{xq), then 

V{F{xk)-F* <e)>l-p. 
Proof. Using Markov inequality and Theorem El we obtain 



P[F(xfc)-F*>e]<iE[F(xfc)-F*] < \(l-^) {F{xo) - F*) < p. 



D 
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F{x) <Ff,{x), xen^. (38) 

def 



3.3 A Regularization Technique 

In this part we will investigate an alternative approach to establishing an iteration complexity result 
in the case of an objective function that is not strongly convex. The strategy is very simple. We first 
regularize the objective function by adding a small quadratic term to it, thus making it strongly 
convex, and then argue that when Algorithm [2] is applied to the regularized objective, we can recover 
an approximate solution of the original non-regularized problem. 

The result obtained in this way is slightly different to the one covered by Theorem H] in that 
2nTZ'j^{xQ) is replaced by 4n||xo — 2;*||i- In some situations, ||xo — x*\\'j^ can be significantly smaller 
than TZ'J^{xq). However, let us remark that the regularizing term depends on quantities that are not 
known in advance. 

Fix xq and e > and consider a regularized version of the objective function defined by 

F^ix)''^'F{x) + !l\\x-xo\\l f, = j^^^^^. (37) 

Clearly, F^ is strongly convex with respect to the norm || • \\l with convexity parameter fi. In the 
rest of this subsection we show that if we apply UCDC(xo) to Ffj_ with target accuracy |, then with 
high probability we recover an e-approximate solution of ([1]). We first need to establish that an 
approximate minimizer of F^ must be an approximate minimizer of F. 

Lemma 8. If x' satisfies F^{x') < min^-gj^Ar -F^(a;) + |, then F{x') < F* + e. 

Proof. Clearly, 

rf^\ <r J, 

If we let X* =' argmin^gj^iv F^(x), then by assumption, 

F,{x') - F^{x;) < ^, (39) 

and 

1371 

F^ix^) = min F{x) + f ||x - xo||i < F{x*) + f ||x* - xo\\l < F{x*) + f . (40) 

Putting all these observations together, we get 

Oil ESJ & 

< F{x') - F{x*) < F^{x')-F{x*) < F^(x*) + f - F(x*) < e. 

D 

The following theorem is an analogue of Theorem 2] 

Theorem 9. Choose initial point xq, target accuracy 

0<e< 2||xo-x*||i, (41) 

target confidence level < /3 < 1, and 

If Xk is the random point generated by UCDC{xq) as applied to F^, then 

P(F(xfc) -F* <e)>l-p. 
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Proof. Let us apply Theorem[7]to the problem of minimizing F^, composed as / + ^^, with ^f^{x) = 
^(x) + ^\\x- xolli- Note that 

Il37l G3 

F^ixo) - F^ix;) y Fixo) - F^ix;) < F(xo) - Fix;) < F(xo) - F*, (43) 



and 



l|33ll,l|33,l|4T]l 4n||xo-x*||2 



Comparing (|36|) and (|4'2|) in view of (|43p and (|44p . Theorem [7] implies that 



(44) 



P{F,{xk)-F,{x;)<^)>l-p. 
It now suffices to apply Lemma [8j D 

4 Coordinate Descent for Smooth Functions 

In this section we give a much simplified and improved treatment of the smooth case (^ = 0) as 
compared to the analysis in Sections 2 and 3 of |13| . 

As alluded to in the above, we will develop the analysis in the smooth case for arbitrary, possibly 
non-Euclidean, norms || • ||(j), i = 1,2, . . . ,n. Let || • || be an arbitrary norm in R'. Then its dual is 
defined in the usual way: 

||s||* = max (s, t). 

The following (Lemma [T0|) is a simple result which is used in |13| without being fully articulated 
nor proved as it constitutes a straightforward extension of a fact that is trivial in the Euclidean 
setting to the case of general norms. Since we think it is perhaps not standard, we believe it deserves 
to be spelled out explicitly. The lemma has the following use. The main problem which needs to be 
solved at each iteration of Algorithm [T] in the smooth case is of the form (I45p . with s = —j-Vif{xk) 



and II ■ II = II • ||{j)- Since || • || is non-Euclidean, we cannot write down the solution of (|45p in a 
closed form a-priori, for all norms. Nevertheless, we can say something about the solution, which 
turns out to be enough for our subsequent analysis. 

Lemma 10. // hy s* we denote an optimal solution of the problem 

mm [u{s) =^-{s,t) + i||tf} , (45) 

then 

n(s#) = -i(||s|^)^ ||s#|| = ||s||*, (as)* = a{s*), aeR. (46) 

Proof. For a = the last statement is trivial. If we fix a 7^ 0, then clearly 



u((as)*) = min mm{-{as, (3t) + i||/?t|P}. 

|it|j = l /3 ^H II J 



For fixed t the solution of the inner problem is /5 = {as, t), whence 

m((qs)#) = min -i(as,t)^ = -iaM max(s,t) j = -i(||as|r)^ (47) 
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proving the first claim. Next, note that optimal t = t* in (j47p maximizes {s,t) over ||t|| = 1 and 
hence {s,t*) = \\s\\* , which implies that 

\\{as)*\\ = 1/3*1 = \{as,f)\ = |a|Ks,t*)| = ia|||s|r = \\as\\* , 

giving the second claim. Finally, since t* depends on s only, we have {as)'"' = I3*t* = {as,t*)t* and, 
in particular, s* = {s,t*)t*. Therefore, (as)* = a{s'^). D 

We can use Lemma fTOl to rewrite the main step of Algorithm [1] in the smooth case into the more 
explicit form, 

T^'-^x) = argmmVi{x,t) = arg min(Vi/(2;), t) + -^||t||(.x 



leading to Algorithm [3j 



Algorithm 3 RCDS(p, xq) (Randomized Coordinate Descent for Smooth Functions) 



for fc = 0,1,2,... do 

Choose ik = i (z {1, 2, . . . , n} with probability pi 
Xfc+i = Xfc - ^Ui{Vif{xk))* 
end for 



The main utility of Lemma [TU] for the purpose of the subsequent complexity analysis comes from 
the fact that it enables us to give an explicit bound on the decrease in the objective function during 
one iteration of the method in the same form as in the Euclidean case: 

f{x) - fix + f/.r«(x)) > -[(v,/(x),T«(x)) + ^||r«(x)||2^)] 

= -LMi-^f) (48) 

^ ¥(l|-^ll(,)^ = i(l|V./(x)||^,)^ 

4.1 Convex Objective 

We are now ready to state the main result of this section. 

Theorem 11. Choose initial point xq, target accuracy < e < min{/(2;o) — /*, 27^^p_i(a;o)}, target 
confidence < p < 1 and 

^> ^^^^ 1+iog^ +2- -r-::;.\ m 



or 

fe>^%i^(l + logi)-2. (50) 

//xfc is the random point generated by RCDS{p,xq) as applied to convex f, then 

F{f{xk) -r<e)>l-p. 
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Proof. Let us first estimate the expected decrease of tlie objective function during one iteration of 
the method: 

n 

f{xk) - E[/(xfc+i) I Xk] = 5^K[/(^fe) - fi^k + UiT('\xk))] 

i=l 
> hT.P^^-i\\'^^fi^k)\\l,)? = ^(I|V/(X,)||^)^ 

where W = LP^^. Since /(x^) < /(xq) for all k and because / is convex, we get f{xk) — /* < 
maxa;.gx*(V/(xfc),Xfc - x*) < \\Vf{xk)\\wTZw{xo), whence 

y 

By rearranging the terms we obtain 

E[/(x,+i) - /* I xu] < f{x,) -r- ^4|g^. 

If we now use Theorem [1] with ^^ = f{xi.) — f* and c = 2TZ^^{xo), we obtain the result for k given 
by (|49p . We now claim that 2 — j- < —2, from which it follows that the result holds for k given by 



/(xfc)-E[/(xfc+i)|xfc]>i'^(^i^)^ 



()50p . Indeed, first notice that this inequality is equivalent to 

f{xo)-r<kK^{xo). (51) 

Now, a straightforward extension of Lemma 2 in |13| to general weights states that V/ is Lipschitz 
with respect to the norm || • \\v with the constant tv{LV~^). This, in turn, implies the inequality 

f{x)-r<ltT{LV~')\\x-X*fy, 

from which (j5ip follows by setting V = W and x = xq. D 



4.2 Strongly Convex Objective 

Assume now that / is strongly convex with respect to the norm || • ||ip-i (see definition (|30p ) with 
convexity parameter fi > 0. Using pop with x = x* and y = xj., we obtain 



r - f{xk) > (v/(xfc), h) + ^\\hhp-i = fi (avfixk),h) + mmlp 



where h = x* — x^. Applying Lemma [10] to estimate the right hand side of the above inequality 
from below we obtain 

r-/(x,)>-2^(||V/(x,)||2p_0'- (52) 

Let us now write down an efficiency estimate for the case of a strongly convex objective. 

Theorem 12. Choose initial point xq, target accuracy < e < /(xq) — /*, target confidence 
< p < 1 and 



fc>^logii2^. (53) 



1 I f{xo)-f 

If Xk is the random point generated by RCDS{p,X()) as applied to f , then 

P(/(x,) - r < e) > 1 - p. 
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Proof. Let us first estimate the expected decrease of tlie objective function during one iteration of 
the method: 



f{xk) - Wi^k+i) I Xk] = 5^Pi[/(^fc) - f{xk + UiT^'\xk))] 

i=l 

m 



> lEp^^^^\\'^^fi^k)\\Uf 

i=l 

= ^(l|V/(xfc)|IIp-0' 

l(52)l 

> f^{f{xk)-n- 

After rearranging the terms we obtain E[/(xfc_|_i) — /* | x^] < (1 — /-f)E[/(x/fc) — /*]. If we now use 



part (ii) of Theorem [T] with ^^ = f{xk) — f* and c = -, we obtain the result. 



D 



5 Comparison of CD Methods with Complexity Guarantees 

In this section we compare the results obtained in this paper with existing CD methods endowed 
with iteration complexity bounds. 



5.1 Smooth case (\1/ = 0) 

In Table [2] we look at the results for unconstrained smooth minimization of Nesterov |13| and 
contrast these with our approach. For brevity we only include results for the non-strongly convex 
case. 



Algorithm 



* 



P^ 



Norms 



Complexity 



Objective 



Nesterov |13| 
(Theorem 4) 

Nesterov |13| 
(Theorem 3) 

Algorithm [3] 
(Theorem [nil 

Algorithm [1 
(Theorem gl) 



Lj 



E.i> 



>0 



separable 



Euclidean 



Euclidean 



general 



Euclidean 



(2n + 



'T,.L 



L^li!!^) log 4(/(^o)-/-') 



8n1ll(xo) , 4(/(^o)-/') 



^■^-'' °' (l + logl)-2 



2nmax{TC|(3:o),F(3;o)-F*} Q ^ x\ 






m + 11^^ 






F{a 



Table 2: Comparison of our results to the results in |13) in the non-strongly convex case. The 
complexity is for achieving P{F{xi.) — F* < e) > 1 — p. 

We will now comment on the contents of Table [2] in detail. 

• Uniform probabilities. Note that in the uniform case {pi = - for all i) we have 

11\p^^{xq) = nUlixo), 
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and hence the leading term (ignoring the logarithmic factor) in the complexity estimate of 
Theorem [TT] (line 3 of Table [2]) coincides with the leading term in the complexity estimate of 
Theorem m (line 4 of Table [2j the second result): in both cases it is 

£ 

Note that the leading term of the complexity estimate given in Theorem 3 of |13) (line 2 of 
Table [2]), which covers the uniform case, is worse by a factor of 4. 

• Probabilities proportional to Lipschitz constants. If we set pt = Li/ S for all i, where 
S = YjiLi, then 

nlp^,{xQ) = Sn]{xQ). 

In this case Theorem 4 in |13| (line 1 of Table [2)) gives the complexity bound 2[n -\ ' ° ] 

(ignoring the logarithmic factor), whereas we obtain the bound ' ° (line 3 of Table [2|), an 

improvement by a factor of 4. Note that there is a further additive decrease by the constant 

2n {and the additional constant rfx\- r* 2 if we look at the sharper bound (j49l) ). 

• General probabilities. Note that unlike the results in |13| . which cover the choice of two 
probability vectors only (lines 1 and 2 of Table [2]) — uniform and proportional to Li — our result 
(line 3 of Table [2]) covers the case of arbitrary probability vector p. This opens the possibility 
for fine-tuning the choice of p, in certain situations, so as to minimize 7^^p_i(a;o). 

• Logarithmic factor. Note that in our results we have managed to push e out of the logarithm. 

• Norms. Our results hold for general norms. 

• No need for regularization. Our results hold for applying the algorithms to F directly; 
i.e., there is no need to first regularize the function by adding a small quadratic term to it 
(in a similar fashion as we have done it in Section 13. 3p . This is an essential feature as the 
regularization constants are not known and hence the complexity results obtained that way 
are not true complexity results. 

5.2 Nonsmooth case (^ 7^ 0) 

In Table [3] we summarize the main characteristics of known complexity results for coordinate (or 
block coordinate) descent methods for minimizing composite functions. 

Note that the methods of Saha & Tewari and Schwarz & Tewari cover the li regularized case 
only, whereas the other methods cover the general block-separable case. However, while the greedy 
approach of Yun & Tseng requires per-iteration work which grows with increasing problem dimen- 
sion, our randomized strategy can be implemented cheaply. This gives an important advantage to 
randomized methods for problems of large enough size. 

The methods of Yun & Tseng and Saha &; Tewari use one Lipschitz constant only, the Lipschitz 
constant L(V/) of the gradient of /. Note that if n is large, this constant is typically much larger 
than the (block) coordinate constants Lj. Schwarz & Tewari use coordinate Lipschitz constants, 
but assume that all of them are the same. This is suboptimal as in many applications the constants 
{Lj} will have a large variation and hence if one chooses j3 = maxj Li for the common Lipschitz 
constant, steplengths will necessarily be small (see Figure [2] in Section [6|). 
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Algorithm 


Lipschitz 
constant (s) 


* 


block 


Choice of 
coordinate 


Work per 
1 iteration 


Yun & Tseng 
122J 


i(V/) 


separable 


Yes 


greedy 


expensive 


Saha & Tewari 

m 


i(V/) 


Mil 


No 


cychc 


cheap 


Shwartz & Tewari 

m 


/3 = maxi Li 


Mil 


No 


i_ 

n 


cheap 


This paper 
(Algorithm ^ 


U 


separable 


Yes 


1 
n 


cheap 



Table 3: Comparison of CD approaches for minimizing composite functions (for which iteration 
complexity results are provided). 



Let us now compare the impact of the Lipschitz constants on the complexity estimates. For 
simplicity assume N = n and let u = x* — xq. The estimates are listed in Tabled! it is clear from 
the last column that the the approach with individual constants Li for each coordinate gives the 
best complexity. 



Algorithm 



Yun & Tseng 



Saha & Tewari 

m 

Shwartz & Tewari 

m 

This paper 
(Algorithm 0) 



complexity 



o( 



o{ 



nL(V/)||:c*-3:o|| 



nL(Vf)\\x'' -xqW 



f n/3 ||a: — xq || 



o(- 

„/ n||3:*-3:o||| ■ 



complexity (expanded) 



o(7E,i(v/)(^W)') 

0(7E,i(V/)(nW)^) 
O(fE,(max,L0(«''^)') 



Table 4: Comparison of iteration complexities of the methods listed in Table [3j The complexity in 
the case of the randomized methods gives iteration counter k for which E(F(xfc) < e) 



6 Numerical Experiments 

In this section we study the numerical behavior of RCDC on synthetic and real problem instances 
of two problem classes: Sparse Regression / Lasso (Section 16. ip |19| and Linear Support Vector 
Machines (Section 16. 2p . Due to space limitations we will devote a separate report to the study of 
the (Sparse) Group Lasso problem. 



20 



As an important concern in Section [6. II is to demonstrate that our methods scale weh with size, 
all experiments were run on a PC with 480GB RAM. All algorithms were written in C. 

6.1 Sparse Regression / Lasso 

Consider the problem 

min i|Ux - b\\l + X\\x\\i, (54) 

where A = [ai, . . . , a„] G R*"^*^, b € R™, and A > 0. The parameter A is used to induce sparsity in 
the resulting solution. Note that (|54p is of the form ([T]), with f{x) = ^\\Ax — b\\2 and ^(x) = A||x||i. 
Moreover, if we let A^ = n and Ui = Ci for all i, then the Lipschitz constants Li can be computed 

explicitly: 

r — II l|2 

Computation of t = T^'^'{x) reduces to the "soft-thresholding" operator |28| . In some of the exper- 
iments in this section we will allow the probability vector p to change throughout the iterations 
even though we do not give a theoretical justification for this. With this modification, a direct 
specialization of RCDC to (j54p takes the form of Algorithm HI If uniform probabilities are used 
throughout, we refer to the method as UCDC. 

Algorithm 4 RCDC for Sparse Regression 



Choose xq € R" and set qq = Axi^ 
for A; = 0,1,2,... do 

Choose if^ = i G {1, 2, . . . , n} with probability pj^ 



.(i) 



a+A -f „{«) g+A ^ n 

f — i Q-A if ^{*) «-A ^ n. 

<^i\\2 ll^'ilb 






otherwise 



Xk+i = Xk + tei, gk+i = gk + tai 
end for 



Instance generator 

In order to be able to test Algorithm |4] under controlled conditions we use a (variant of the) instance 
generator proposed in Section 6 of |12| (the generator was presented for A = 1 but can be easily 
extended to any A > 0). In it, one chooses the sparsity level of A and the optimal solution x*; after 
that A, 6, X* and F* = F{x*) are generated. For details we refer the reader to the aforementioned 
paper. 

In what follows we use the notation ||A||o and ||2;||o to denote the number of nonzero elements 
of matrix A and of vector x, respectively. 

Speed versus sparsity 

In the first experiment we investigate, on problems of size m = 10^ and n = 10^, the dependence 
of the time it takes for UCDC to complete a block of n iterations (the measurements were done by 
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running the method for 10 x n iterations and then dividing by 10) on the sparsity levels of A and x* . 
Looking at Table [5l we see that the speed of UCDC depends roughly linearly on the sparsity level 
of ^ (and does not depend on ||x*||o at all). Indeed, as ||^||o increases from 10^ through 10^ to 10^, 
the time it takes for the method to complete n iterations increases from about 0.9s through 4-6s to 
about 46 seconds. This is to be expected since the amount of work per iteration of the method in 
which coordinate i is chosen is proportional to ||ai||o (computation of a, HfliHl and gk+i)- 



16 X 10^ 
16 X 10^ 
16 X lO'i 



ll^llo^lO^ 



0.89 
0.85 
0.86 



ll^llo = 10*^ 



5.89 
5.83 

4.28 



ll^llo = 10^ 



46.23 
46.07 
46.93 



Table 5: The time it takes for UCDC to complete a block of n iterations increases linearly with 
\\A\\q and does not depend on |l2;*||o. 



Efficiency on huge-scale problems 

Tables [6] and [7] present typical results of the performance of UCDC, started from xq = 0, on synthetic 
sparse regression instances of big/huge size. The instance in the first table is of size m = 2 x 10^ 
and n = 10^, with A having 5 x 10^ nonzeros and the support of x* being of size 160,000. 



Agr-" 



\A\ 



k/n 


F(xk)-F' 
F(xo)-F* 


h^fc 


time [sec] 


0.0000 


10" 





0.0 


2.1180 


10-1 


880,056 


5.6 


4.6350 


10-2 


990,166 


12.3 


5.6250 


10-^' 


996,121 


15.1 


7.9310 


10-"* 


998,981 


20.7 


10.3920 


10-^ 


997,394 


27.4 


12.1100 


10-^ 


993,569 


32.3 


14.4640 


10-^ 


977,260 


38.3 


18.0720 


10"* 


847,156 


48.1 


19.5190 


10-9 


701,449 


51.7 


21.4650 


10-1° 


413,163 


56.4 


23.9150 


10-" 


210,624 


63.1 


25.1750 


10-12 


179,355 


66.6 


27.3820 


10-" 


163,048 


72.4 


29.9610 


10-1* 


160,311 


79.3 



= 5 ■ 10^ 








k/n 


F(xk)-F' 
F(xo)-F' 


a^fc 


time [sec] 


30.9440 


10-1^ 


160,139 


82.0 


32.7480 


10-" 


160,021 


86.6 


34.1740 


10-" 


160,003 


90.1 


35.2550 


io-i« 


160,000 


93.0 


36.5480 


10-" 


160,000 


96.6 


38.5210 


10-2« 


160,000 


101.4 


39.9860 


10-21 


160,000 


105.3 


40.9770 


10-22 


160,000 


108.1 


43.1390 


10-2^ 


160,000 


113.7 


47.2780 


10-2* 


160,000 


124.8 


47.2790 


10-25 


160,000 


124.8 


47.9580 


10-26 


160,000 


126.4 


49.5840 


10-2^ 


160,000 


130.3 


52.3130 


10-28 


160,000 


136.8 


53.4310 


10-29 


160,000 


139.4 



Table 6: Performance of UCDC on a sparse regression instance with a million variables. 

In both tables the first column corresponds to the "full-pass" iteration counter k/n. That is, 
after k = n coordinate iterations the value of this counter is 1, reflecting a single "pass" through 
the coordinates. The remaining columns correspond to, respectively, the size of the current residual 
F{xk) — F* relative to the initial residual F[xq) — F*, size H^fc ||o of the support of the current iterate 
X)^, and time (in seconds). A row is added whenever the residual initial residual is decreased by an 
additional factor of 10. 

Let us first look at the smaller of the two problems (Table [6]). After 35 x n coordinate iterations, 
UCDC decreases the initial residual by a factor of 10^^, and this takes about a minute and a half. 
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.leRio X 


'" ,\\A\\o = 2 


xlO^o 


k/n 


F{^„)-F' 


\M\o 


time [hours] 
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14,923,993 
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22,688,665 


4.25 


16 


10-^^ 


24,090,068 


22.65 



Table 7: Performance of UCDC on a sparse regression instance with a billion variables and 20 billion 
nonzeros in matrix A. 

Note that the number of nonzeros of x^ has stabilized at this point at 160, 000, the support size 
of the optima solution. The method has managed to identify the support. After 139.4 seconds the 
residual is decreased by a factor of 10^^. This surprising convergence speed can in part be explained 
by the fact that for random instances with m > n, f will typically be strongly convex, in which 
case UCDC converges linearly (Theorem [7]). 

UCDC has a very similar behavior on the larger problem as well (Table [7]). Note that A has 20 
billion nonzeros. In 1 x n iterations the initial residual is decreased by a factor of 10, and this takes 
less than an hour and a half. After less than a day, the residual is decreased by a factor of 1000. 
Note that it is very unusual for convex optimization methods equipped with iteration complexity 
guarantees to be able to solve problems of these sizes. 

Performance on fat matrices (m < n) 

When m < n, then / is not strongly convex and UCDC has the complexity 0(- log -) (Theorem d]). 
In Table [8] we illustrate the behavior of the method on such an instance; we have chosen m = 10^, 
n = 10^, Pllo = 10^ and ||x*||o = 1,600. Note that after the first 5,010 x n iterations UCDC 
decreases the residual by a factor of 10+ only; this takes less than 19 minutes. However, the 
decrease from 10^ to 10~^ is done in 15 x n iterations and takes 3 seconds only, suggesting very fast 
local convergence. 
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4,402.77 
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< 10^ 
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4,410.09 
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< 10' 


1,856 


4,411.04 
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<10° 


1,609 


4,411.63 


21,465 


< 10-1 


1,600 


4,412.21 


21,468 


< 10-2 


1,600 


4,412.79 


21,471 


< 10"^ 


1,600 


4,413.38 



Table 8: UCDC needs many more iterations when m < n, but local convergence is still fast. 
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Comparing different probability vectors 



Nesterov |13) considers only probabilities proportional to a power of the Lipschitz constants: 



Pi 






< Q < 1. 



(55) 



In Figure [T] we compare the behavior of RCDC, with the probability vector chosen according to the 
power law (|55|l . for three different values of a (0, 0.5 and 1). All variants of RCDC were compared 
on a single instance with m = 1,000, n = 2,000 and ||2;*||o = 300 (different instances produced by 
the generator yield similar results) and with A € {0, 1}. The plot on the left corresponds to A = 0, 
the plot on the right to A = 1. 
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Figure 1: Development of F{xk) 
(right). 



F* for sparse regression problem with A = (left) and A = 1 



Note that in both cases the choice a = 1 is the best. In other words, coordinates with large Lj 
have a tendency to decrease the objective function the most. However, looking at the A = case, we 
see that the method with a = 1 stalls after about 20,000 iterations. The reason for this is that now 
the coordinates with small Lj should be chosen to further decrease the objective value. However, 
they are chosen with very small probability and hence the slowdown. A solution to this could be to 
start the method with a = 1 and then switch to a = later on. On the problem with A = 1 this 
effect is less pronounced. This is to be expected as now the objective function is a combination of 
/ and ^, with ^ exerting its influence and mitigating the effect of the Lipschitz constants. 

Coordinate Descent vs. a Full-Gradient method 

In Figure [1] we compare the performance of RCDC with the full gradient (FG) algorithm |12| (with 
the Lipschitz constant Lpc = Amax(^^) > rnaxj Li) for four different distributions of the Lipschitz 
constants Li. Since the work performed during one iteration of FG is comparable with the work 
performed by UCDC during n coordinate iterations, for FG we multiply the iteration count by n. 
In all four tests we solve instances with A G j^2,000xi,000_ 

In the 1-1 plot the Lipschitz constants Lj were generated uniformly at random in the interval 
(0, 1). We see that the RCDC variants with a = and a = 0.2 exhibit virtually the same behavior, 
whereas a = 1 and FG struggle finding a solution with error tolerance below 10"^ and 10~^, 
respectively. The a = 1 method does start off a bit faster, but then stalls due to the fact that 
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the coordinates with small Lipschitz constants are chosen with extremely small probabilities. For a 
more accurate solution one needs to be updating these coordinates as well. 

In order to zoom in on this phenomenon, in the 1-2 plot we construct an instance with an 
extreme distribution of Lipschitz constants: 98% of the constants have the value 10~^, whereas the 
remaining 2% have the value 10'^. Note that while the FG and a = 1 methods are able to quickly 
decrease the objective function within 10~^ of the optimum, they get stuck afterwards since they 
effectively never update the coordinates with Li = 10~^. On the other hand, the a = method 
starts off slowly, but does not stop and manages to solve the problem eventually, in about 2 x 10^ 
iterations. 
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Figure 2: Comparison UCDC with different choices of a with a full-gradient method (which es- 
sentially is UCDC with one component: n = 1) for four different distributions of the Lipschitz 
constants Lj. 

In the plot in the 2-1 position (resp. 2-2 position) we choose 70% (resp. 50%) of the Lipschitz 
constants Li to be 1, and the remaining 30% (resp. 50%) equal to 100. Again, the a = and 
a = 0.2 methods give the best long-term performance. 

In summary, if fast convergence to a solution with a moderate accuracy us needed, then a = 1 
is the best choice (and is always better than FG). If one desires a solution of higher accuracy, it is 
recommended to switch to a = 0. In fact, it turns out that we can do much better than this using 
a "shrinking" heuristic. 
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Speedup by shrinking 



It is well-known that increasing values of A encourage increased sparsity in the solution of (|54p . In 
the experimental setup of this section we observe that from certain iteration onwards, the sparsity 
pattern of the iterates of RCDC is a very good predictor of the sparsity pattern of the optimal 
solution X* the iterates converge to. More specifically, we often observe in numerical experiments 
that for large enough k the following holds: 



(-? 



.(0 



Ui) 



0) ^ (V/ > k x\'> = {x*Y> = 0). (56) 

In words, for large enough k, zeros in Xk typically stay zeros in all subsequent iterate^j and corre- 
spond to zeros in x* . Note that RCDC is not able to take advantage of this. Indeed, RCDC, as 
presented in the theoretical sections of this paper, uses the fixed probability vector p to randomly 
pick a single coordinate i to be updated in each iteration. Hence, eventually, ^. {i)_„Pi proportion 

k 

of time will be spent on vacuous updates. 

Looking at the data in Table [6] one can see that after approximately 35 x n iterations, x^ has 
the same number of non-zeros as x* (160,000). What is not visible in the table is that, in fact, 
the relation (j56p holds for this instance much sooner. In Figure [3] we illustrate this phenomenon in 
more detail on an instance with m = 500, n = 1,000 and ||x*||o = 100. 
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Figure 3: Development of non-zero elements in x\^. 

First, note that the number oj nonzeros (solid blue line) in the current iterate, #{i : x^ 7^ 0}, 
is first growing from zero (since we start with xq = 0) to just below n in about 0.6 x 10^ iterations. 
This value than starts to decrease starting from about k ~ 15ri and reaches the optimal number of 
nonzeros at iteration k ~ 30n and stays there afterwards. Note that the number of correct nonzeros^ 



CUk 



Si) 



#{i:x^VO&(x*)W/0}, 



(i) 



is increasing (for this particular instance) and reaches the optimal level ||x*||o very quickly (at 
around k ~ 3n). An alternative, and perhaps a more natural, way to look at the same thing is via 
the number of incorrect zeros, 



IZk 



*{i ■■ xf 



0&(x*)« ^0}. 



^There are various theoretical results on the identification of active manifolds explaining numerical observations 
of this type; see [7] and the references therein. See also |28) . 
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Indeed, we have crik + izk = ||2;*||o- Note that for our problem izk ~ for k > k^ ^ 3n. 

The above discussion suggests that an iterate- dependent pohcy for updating of the probabihty 
vectors pk in Algorithm [¥] might help to accelerate the method. Let us now introduce a simple 
q-shrinking strategy for adaptively changing the probabilities as follows: at iteration k > ko, where 
ko is large enough, set 



Pk 



,(i) 



Pk il) 



def 



if X 



(0 



0, 



+ 



F'fellO ' 



otherwise. 



This is equivalent to choosing i^ uniformly from the set {1, 2, . . . , n} with probability 1 — q and 
uniformly from the support set of x^ with probability q. Clearly, different variants of this can be 
implemented, such as fixing a new probability vector for k > k^ (as opposed to changing it for 
every k) ; and some may be more effective and/or efficient than others in a particular context. In 
Figure[4]we illustrate the effectiveness of g-shrinking on an instance of size m = 500, n = 1, 000 with 
||a;*||o = 50. We apply to this problem a modified version of RCDC started from the origin {xq = 0) 
in which uniform probabilities are used in iterations 0, . . . , feg — 1, and g-shrinking is introduced as 
of iteration k^: 

for k = 0,1, . . . ,kQ — 1, 

for k > ko- 




Pk i^), 



We have used ko = 5 x n. 
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Figure 4: Comparison of different shrinking strategies. 



Notice that as the number of nonzero elements of x^ decreases, the time savings from (/-shrinking 
grow. Indeed, 0.9-shrinking introduces a saving of nearly 70% when compared to 0-shrinking to 
obtain x^ satisfying F{xk) — F* < 10~^^. We have repeated this experiment with two modifications: 
a) a random point was used as the initial iterate (scaled so that ||a;o||o = n) and b) /cq = 0. The 
corresponding plots are very similar to Figure U] with the exception that the lines in the second plot 
start from ||xo||o = n. 



Choice of the initial point 

Let us now investigate the question of the choice of the initial iterate xq for RCDC. Two choices 
seem very natural: a) xq = (the minimizer of $(x) = A||x|[i) and b) xq = xls (the minimizer of 
f{x) = \\\-A.x — hW^). Note that the computation of x^s may be as complex as the solution of the 
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original problem. However, if available, xls constitutes a reasonable alternative to 0: intuitively, 
the former will be preferable to the latter whenever ^ is dominated by /, i.e., when A is small. 

In Figure [5] we compare the performance of UCDC run on a single instance when started from 
these two starting points (the solid line corresponds to xq = whereas the dashed line corresponds 
to xq = Xls)- The same instance is used here as in the g-shrinking experiments and A = 1. Starting 
from Xls gives a 4x speedup for pushing residual F{xk) — F* below 10~^. 
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Figure 5: Starting from the least squares solution, if available, and if A is small enough, can be 
better than starting from the origin. 

In Figure [6] we investigate the effect of starting UCDC from a point on the line segment between 
Xls (dashed red line) and (solid blue line). We generate 50 such points xq, uniformly at random 
(thin green lines). The plot on the left corresponds to the choice A = 0.01, the plot on the right to 
A = 1. Note that x* = xls for A = and x* = when A — )■ oo. 
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Figure 6: There does not seem to be any advantage in starting UCDC from a point on the line 
segment between xls ^^d the origin as opposed to starting it from the better of two endpoints. 



6.2 Linear Support Vector Machines 

Consider the problem of training a linear classifier with training examples {(xi, yi), . . . , (x^, 2/»n)}) 
where Xi axe the feature vectors and yi G { — 1,+!} the corresponding labels (classes). This problem 
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is usually cast as an optimization problem of the form (JT]) , 



min F{w) = f{w) + ^{w), 



(57) 



where 



f{w) =-i^C{w; 



^i 1 Ui 



i=l 



£ is a nonnegative convex loss function and '!'(•) = || • ||i for Ll-regularized and ^(•) = || • ||2 for 
L2-regularized linear classifier. Some popular loss functions are listed in Table [9j For more details 
we refer the reader to [28] and the references therein; for a survey of recent advances in large-scale 
linear classification see 1291. 



£.{w;xi,yi) 



max{0, 1 — yjW^ Xj} 
max{0, 1 — yjuFxj}"^ 

logfl + e-2^^"''^^0 



name 



Ll-SVM loss (Ll-SVM) 

L2-SVM loss (L2-SVM) 

logistic loss (LG) 



property 



C^ continuous 
C^ continuous 
C^ continuous 



Table 9: A list of a few popular loss functions. 

Because our setup requires / to be at least C"^ continuous, we will consider the L2-SVM and 
LG loss functions only. In the experiments below we consider the LI regularized setup. 

A few implementation remarks 

The Lipschitz constants and coordinate derivatives of / for the L2-SVM and LG loss functions are 
listed in Table [TOl 



Loss function 


U 






v./(^) 




L2-SVM 
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-27 




Y. y.-?C^- 


- VjW^Xj) 




i=i 




J : - 


-yjwTxj>-l 




LG 


m 




-7 





Table 10: Lipschitz constants and coordinate derivatives for SVM. 



For an efficient implementation of UCDC we need to be able to cheaply update the partial 
derivatives after each step of the method. If at step k coordinate i gets updated, via Wk+i = Wk + tCi, 
yjuFxj for J = 1, . . . , m, then 



and we let r\. — "■"■•^ 



(i) (i) (i) 



(58) 



Let Oi be the number of observations feature i appears in, i.e., Oj = ^{j : x^ ^ 0}. Then the 
update (|58p . and consequently the update of the partial derivative (see Table [TOj) . requires 0{oi) 
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operations. In particular, in feature- sparse problems where - Y17=i '^i ^ "^' ^^ average iteration of 
UCDC will be very cheap. 

Small scale test 

We perform only preliminary results on the dataset rcvl .binarjQ. This dataset has 47,236 features 
and 20,242 training and 677,399 testing instances. We train the classifier on 90% of training instances 
(18,217); the rest we used for cross-validation for the selection of the parameter 7. In Table [TT] we 
list cross-validation accuracy (CV-A) for various choices of 7 and testing accuracy (TA) on 677,399 
instances. The best constant 7 is 1 for both loss functions in cross-validation. 



Loss function 


7 


CV-A 


TA 


7 


CV-A 


TA 


L2-SVM 


0.0625 


94.1% 


93.2% 


2 


97.0% 


95.6% 




0.1250 


95.5% 


94.5% 


4 


97.0% 


95.4% 




0.2500 


96.5% 


95.4% 


8 


96.9% 


95.1% 




0.5000 


97.0% 


95.8% 


16 


96.7% 


95.0% 




1.0000 


97.0% 


95.8% 


32 


96.4% 


94.9% 


LG 


0.5000 


0.0% 


0.0% 


8 


40.7% 


37.0% 




1.0000 


96.4% 


95.2% 


16 


37.7% 


36.0% 




2.0000 


43.2% 


39.4% 


32 


37.6% 


33.4% 




4.0000 


39.3% 


36.5% 


64 


36.9% 


34.1% 



Table 11: Cross validation accuracy (CV-A) and testing accuracy (TA) for various choices of 7. 

In Figure [7] we present dependence of TA on the number of iterations we run UCDC for (we 
measure this number in multiples of n). As you can observe, UCDC finds good solution after 10 x n 
iterations, which for this data means less then half a second. Let us remark that we did not include 
bias term or any scaling of the data. 

Large scale test 

We have used the dataset kdd2010 (bridge to algebrajl, which has 29,890,095 features and 19,264,097 
training and 748,401 testing instances. Training the classifier on the entire training set required 
approximately 70 seconds in the case of L2-SVM loss and 112 seconds in the case of LG loss. We 
have run UCDC for n iterations. 
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